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Abstract 

We investigate the sequence-dependent behaviour of locahsed excitations in a toy, nonhnear 
model of DNA base-pair opening originally proposed by Salerno. Specifically we ask whether 
"breather" solitons could play a role in the facilitated location of promoters by RNA polymerase. 
In an effective potential formalism, we find excellent correlation between potential minima and 
Escherichia coli promoter recognition sites in the T7 bacteriophage genome. Evidence for a 
similar relationship between phage promoters and downstream coding regions is found and 
alternative reasons for links between AT richness and transcriptionally-significant sites are dis- 
cussed. Consideration of the soliton energy of translocation provides a novel dynamical picture 
of sliding: steep potential gradients correspond to deterministic motion, while "fiat" regions, 
corresponding to homogeneous AT or GC content, are governed by random, thermal motion. 
Finally we demonstrate an interesting equivalence between planar, breather solitons and the 
helical motion of a sliding protein "particle" about a bent DNA axis. 
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1 Introduction 



Protein-DNA interactions play many of the fundamental roles in gene regulation. An under- 
standing of the mechanisms involved in these processes is one of the major current goals for 
numerous biological sciences. With large repositories of genetic information available - and costs 
associated with difficult, highly specific experiments - the question of how well such molecular 
interactions can be simulated is clearly important to investigate. Enzymes and many transcrip- 
tional factors are proteins, often composed of tens to hundreds of amino acids, while the DNA 
domains to which they bind can contain, in the case of prokaryotes, up to 10^ nucleotide bases. 
All-atom modelling of such a molecular complex, even neglecting the key roles of hydration and 
ions, is beyond current computational ability. 

An alternative, logical first step is to consider one specific kind of interaction, focussing 
exclusively on its salient features and develop an accordingly simplified model. In this spirit, 
simple dynamical models of DNA have been studied for almost two decades, most successfully in 
regard to describing denaturation experiments (see Ref. PP for a review). The process by which 
the motions of small DNA molecules containing ~ 10^ atoms are "coarsened" to the nucleotide 
base-pair, level has previously been qualitatively argued Typically the DNA molecule is 
modelled by one degree of freedom per base-pair: a radial "stretching" [H] or a pendulum-like 
base "flipping" (Ref. [3] being a recent review). 

We note, in this context, that ab initio calculations of small DNA oligomers |5] suggest 
that base-pair motions can be accurately approximated at the dinucleotide level in terms of 
two or three quasi-rigid, internal degrees of freedom, lending some weight to the coarse-graining 
assumptions previously made. The motivation for these simple models is that, if experimental 
results can be described with a small number of degrees of freedom, then these degrees of freedom 
must be the dominant ones for the process in question. The applications of such a model are 
necessarily restricted to extremely specific instances of DNA behaviour |2]. Given that many 
regulatory processes are governed by highly-specific, localised denaturation of the DNA helix, it 
is logical to investigate sequence-dependent dynamical behaviours in such a setting. 

In 1991 Salerno [H| proposed a base-flipping model of the bacterial promoter DNA sequence 
Ai in the T7 genome, suggesting the sequence had special, "dynamically active" qualities with 
regard to propagating kink solitons. Subsequent investigations of other host-specific promoter 
T7 sequences [Z]-[2] made similar findings. Moreover in Ref. [HI it was suggested that solitons 
could be created as conformational changes to the DNA helixdue to DNA-RNAP interactions. 
Recently the propagation of kinks through the entire T7 genome sequence has been studied |lf)) . 
although the (significant) differences between host- and phage-specific promoter sequences were 
neglected. Another paper investigated, whether kinks might propagate differently in coding 
and non-coding sequences. 

Solitons had been previously suggested to have a role jl2j in DNA transcription in the 
1980's, however at least one picture which developed ^Hj) of an RNA polymerase (RNAP) 
molecule "surfing" a thermally-driven region of open base pairs is inconsistent with the known 
conformational changes of RNAP and DNA which occur during open complex formation. Sec- 
ondly, no absorption resonances have been observed in microwave spectroscopy of DNA |14| . 
|15j . Crucially the original motivation for invoking solitons: anomalously long lifetimes of DNA 
base-pair openings, was shown to result from misinterpretation of data . DNA solitons were 
vigorously dispelled by some researchers |17j with the result that they remain little more than 
a curiosity outside of the nonlinear physics community. 

More generally, a variety of studies jTH], [201 have suggested connections between the 
base-sequence dependence of helical thermal stability and transcriptional regulation sites. The 
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common feature in all these studies, soliton models included, is that AT-rich regions are dis- 
tinguished by way of reduced thermal stability relative to GC-rich regions. Of course regions 
rich in AT might stand out from a regulatory perspective for geometric, mechanical or chemical 
reasons. For example, tracts of A or T nucleotides carry intrinsic curvature and confer 
rigidity to the DNA superstructure - j23j . leading to significant departures from the average 
B-DNA form. Finally the role of counterions in determining DNA structure such as bending and 
groove width |2l] and affinity for such tracts is not yet fully understood. Sequence-dependent 
variations in the electrostatic surface of DNA may also present a unique "signature" in promoter 
regions |25] . 

It should be emphasised that the solitons we consider have nothing to do with thermally- 
driven, transient base-pair openings, the source of the original controversy. Protein-DNA in- 
teractions involve conformational changes of the DNA helix and in our opinion it is logical to 
investigate, if a small change might be modelled by a structural perturbation to a regular B- 
DNA helix, whether its translocation might be approximated by a soliton propagating through 
a nonlinear medium. 

The structure of the paper is as follows: We briefly discuss aspects of protein-sliding and 
review the mechanism of lytic infection of Escherichia coli by T7 bacteriophage. Assuming DNA 
molecules can actually support nonlinear, quasi-solitonic excitations like those of the simple 
"base-flipping" model [HI, we discuss the kind of biological roles they might serve. To this end 
we introduce the inhomogeneous Frenkel-Kontorova (IFK) model [SHI, the basis of Salerno's 
approach HI-IHl! and its breather soliton solutions. The propagation of breather solutions is 
analysed via an effective potential formalism and we compute the energy landscape, comparing 
extrema with bacterial and phage promoters of the T7 genome. Supposing then, that solitonic 
excitations of DNA do not exist, we discuss alternative reasons why the correlations obtained 
between regulatory features and potential extrema might have been obtained. In particular 
a novel equivalence between base-flipping stability in the planar IFK model and bending in a 
3-dimensional, helical model is outlined. 

1.1 Protein sliding 

The mechanisms by which regulatory proteins, such as RNAP, can recognise their specific binding 
sites among tens, or even hundreds of thousands, of structurally identical nonspecific sites on 
a DNA strand are generally not well understood. A widely accepted hypothesis is that many 
proteins have several modes |27j with which to bind to DNA: 

Nonspecific binding occurs when a polar domain of the protein displaces cations in the 
major /minor grooves of the DNA helix. The protein effectively slides, one-dimensionally, along 
the groove through a series of nonspecific binding events. The translocation mechanism for the 
nonspecific complex is not known, but is ATP-independent j25j and widely assumed to be driven 
by thermal motion. Other possible diffusion modes include "hopping", where a nonspecifically 
bound protein dissociates and re-associates within the same DNA domain. For some proteins, 
such as lac repressor j2El ) a possibility for transfer between sequentially-distant regions of DNA 
brought close together in 3-dimensional space also exists. In general, the facilitated location of 
an operator by a protein is likely to consist of a sequence of sliding, hopping or intersegmental 
transfer events [57] • -A- growing body of evidence exists that many regulatory proteins, such as 
repressors jJH], [22] ENAP 's |i3n]- |34j . nucleases [SE] and methylases ISE] locate their operator 
sites in this way. 

While the sliding component of facilitated target location is commonly assumed to be ther- 
mally driven we point out there is no experimental evidence to preclude dynamical effects re- 
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suiting from local, sequence-dependent mechanical properties of DNA. In the seminal paper of 
Berg and co-workers [S7j there are two assumptions, in particular, which may be unsatisfactory. 
Firstly the facilitated transport model is derived under the assumption of a homogeneous, free 
protein distribution. While such conditions can be arranged in vitro this is not the case for a 
biological system. Secondly there is no real account of degrees of molecular recognition: opera- 
tor sites are treated as "sinks" , defining a boundary condition for the one-dimensional diffusion 
equation. It is highly probable that some kind of "reading" process also occurs, mediated by 
the electrostatic interactions between protein residues and nucleotide functional groups lying in 
one, or both, of the grooves. 

The limited empirical studies of sliding RNAP performed to date, for example Refs |3Uj - 
j34j . invariably average behaviour over many individual sliding events, masking any sequence- 
dependent variation which might exist. On the other hand a recent model of the hypothetical 
reading process [HH] was built, based upon the assertion that 

...the protein should follow a noise-influenced, sequence-dependent motion that in- 
cludes the possibility of slowing down, pauses and stops... 

Let us qualitatively envisage then, how soliton-like deformations might arise in RNAP-DNA 
interactions. The presence of enzymes as "mass defects" in a 1-dimensional DNA model have pre- 
viously been considered [SHI ) 001 1 [13 with regard to thermal breathers and transcribing RNAP. 
In distinction, initial binding to a nonspecific DNA site entails insertion of polymerase domains 
into the major groove of the helix, where the displacement of counterions occurs. Suppose that 
the initial contact and recoil of RNAP during association induces a localised deformation in the 
B-DNA helix, which we approximate by a breather soliton. Breather excitations in the IFK 
model 26^ are, due to the discreteness of the model, inherently unstable. Therefore the initially 
stationary breather would propagate along the strand, preferentially in a direction determined 
by local inhomogeneities in the base sequence. Further, the deformation is not a true soliton, 
owing to the discreteness of the DNA lattice, and radiates energy, eventually dissipating. In this 
regard we also note the mean sliding distance of RNAP's are known to be highly sensitive to 
variations in cation concentration |31j.|32j. 

There are two pictures which are plausibly consistent with noisy, deterministic dynamics: 
either the RNAP can effectively "surf" the breather or that randomly moving RNAP and deter- 
ministically travelling breathers can interact somehow on collision. Regarding the latter, little 
is known about the structure of nonspecific protein-DNA complexes and for the remainder of 
the paper we consider the former, less speculative, scenario. 

1.2 T7 bacteriophage 

Bacteriophage T7 is a member of the Podovirales family of viruses, which cause lytic infection of 
bacteria. Its simple regulatory apparatus is one of the most widely studied, serving as a model for 
genomes of more complex organisms. As mentioned above, previous nonlinear DNA studies 0- 
[H], HO] have involved the T7 phage genome sequence. However these studies focussed exclusively 
on the base sequence, with no consideration for the possible changing biological context of the 
information it contains. In fact the essentially linear processes of T7 DNA translocation and 
gene expression make this phage an excellent case study. 

T7 is known to inject its double-stranded, linear DNA into a host E. coli cell in a stepwise, 
transcription-dependent manner The T7 genome contains 39,937 base pairs but initially 
only the first 850 of these base are translocated from the phage particle |3S] • This initial fragment 
contains three strong promoters specific to E. coli RNAP Ai — A^ (in addition to the minor Aq, or 
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Z?, promoter with no known in vivo function) which initiate transcription of the phage sequence. 
The remainder of the genome is divided into three sections: The "early" region contains class I 
genes - those responsible for modifying host metabolism to favour phage production; the middle 
region, where class II genes govern phage replication; the "late region" of class III genes driving 
maturation and packaging of newly assembled phage DNA strands. 

Transcription of the initial fragment serves a dual function: "pulling" downstream, early 
DNA from the phage particle into the host cell, in addition to transcribing the class I genes. 
The product of the first of these, gene 0.3, inactivates host defence (specifically type I restric- 
tion/modification) systems, therefore rapid recognition of a major promoter is vital for success- 
ful infection by wild-type T7. Another product of this early region is a T7 RNA polymerase, 
recognising its own specific promoters, which is responsible for transcribing the remainder of 
the genome, a process which proceeds in two steps: Entry of the middle region into the host 
cell is dependent upon the successful translation of class I genes. In turn, translocation of the 
late-transcribed region requires the products of early and mid-regions. 

In contrast to gene expression in more complex organisms, there are very few "feedback" 
loops, indeed a virtual simulation of the T7 life cycle has been developed j42j . There are two 
known loops which may have relevance to our analysis below: mid-late inhibition of class I 
(host-specific) promoters |43) and late inhibition of the mid (phage-specific) promoters |44j. 

2 The model 

At physiological temperatures the physically dominant mode of base-pair opening is the base- 
flipping, pendular oscillations of bases about their N-glycosydic bond in the mean base-pair 
plane. Such models previously considered for biological roles [n]-^^ are based upon the IFK 
|26j Hamiltonian: 

^71 ^ n— 1 

1=1 i=l 
n 

+ ^ai{l-cos{9^-i;i)). (1) 

i=l 

Here 9i, ipi are the angles of deflection of the i^^ base "pendulum" and that of its complement 
from equilibrium, while Jj is the inertial moment. Nearest-neighbour bases are coupled by an 
harmonic torsion potential with "stiffnesses" Kj, Ri. Finally Uj is the characteristic strength of 
the nonlinear H-bonding potential between complementary bases. 

In earlier studies [Hl-IHl homogeneous inertial moments and stiffness constants were assumed, 
with the only sequence-dependence residing in the H-bonding coupling constants (7j. Speciflcally, 
for i, 1 . . . , n 

/j = J, Ki = K = Ki- 

In addition it was assumed that ctj = Aj/c where A: is a generic coupling and Aj takes the values 
2 and 3 for A.T and G.C pairs respectively, accounting for the differing numbers of base pairs. 
With these approximations, one passes to angle sum- and difference-coordinates 
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The Hamiltonian thus obtained is 

n-l 



^2^ 



i=l i=l 
n 

+ ^Aifc(l -cos(ni)). (2) 

1=1 

The equations of motion for the Ui reduce to the set of dimensionless, coupled equations: 

Ui - (uj+i - 2ui + Ui-i) + Pi sinui = 0; Ui = 9i - ipi, (3) 

where the time variable has been rescaled, t y^TJkt, and the parameter /3j = Xik/K = Xiij. To 
model the sequence variation as small perturbations to a homogenous solution, we first require 
the average value of the parameters (3i : 

f5=(2^ + 3(l-^)),, (4) 



n \ n 

where there are uat occurrences of A.T pairs in the molecule. In a purely homogeneous approx- 
imation, f3i P, in the continuum limit the system of equations Q reduces to the sine-Gordon 
equation, 

u — u" + P sin u = 0, (5) 

which has a rich variety of solitonic solutions. A family of "breather" solutions of Eq.© with 
lengths and internal frequencies oj^ is defined by 

Ubr[x,t) =4tan ( -^sech(-— ) I , (6) 

where in terms of the classifying parameter ^ 

= /3~^/^cosec/i, oj^ = /J^^^cos//. 

Note that the above relation imposes a minimum breather width and frequency which the model 
can support for a given set of environmental conditions. The smallness of P ensures that an 
approximate solution of the discrete, inhomogeneous model is of a similar form with slowly- 
varying parameters, thus our ansatz for Eq.© is 

Uji = 4tan — sechz^; z„ = (n — Aj/L. (7) 

Here X is understood as a collective coordinate for the breather and, for convenience we have 
omitted the ^ subscript. 

If the total energy is approximately conserved, upon substituting ((Jj) into the Hamiltonian 
@ one arrives at an expression for the effective potential in the collective coordinate X [S], 
associated with the propagation of the initial excitation on an inhomogenous background. We 
find, using the identity 

1 — cosu = 8/(tanu/4 + cot ti/4)^, 
the expression for the total energy takes the form 

E{X- 1) = K{X- 1) + V{X- 1) + 0(/?2) = 0. (8) 
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Here 



K{X;t) = (a(t)sinh2zi(X)2 

i=l 



1 



-ujL\I a{t){- — — 2 — a{t)) smli2ziX 



ViX;t) = 8Y,DiiX;t)( ij^-iv^ ait)) cosh^z^ 



1=1 

1 



D,{X;t) 



+a{t){^ sinh^ Zi + /3j cosh^ 
1 



(a(t) + cosh Zj)^ 



where the function a{t) = {sm{ujt)/u!L)'^ governs the time-dependence of V. If the breather 
oscihation timescale is typicahy orders of magnitude smaller than that of its propagation along 
the DNA |1J we can replace the time-dependent potential by its average value j47j : 



V^X) ^ ^£v{X;t) 



4 (sec^/x -|- PiL"^ tan^ /x) cosh zi . , 

(tanV + cosh^ Zi)3/2 

Owing to the nonlinear nature of the model the energy to translocate the initial deformation is 
several orders of magnitude less that required to create the breather initially. We can derive a 
simple estimate of the "noisiness" of the sliding dynamics from the energy required to shift the 
breather by one base-pair: 

e{X) = -^{Vav{Xi+i) - VaviXi)) (10) 

where Vav is the time-averaged potential Q and ks is Boltzmann's constant. For steep gradients 
the picture of sliding RNAP is thus analogous to a particle moving through an energy landscape 
while in flat regions it is more akin to a random walk. 



3 Results 

Having derived the breather effective potential @ we compute the "landscape" corresponding 
to the T7 genome. It is natural, initially, to assume breather width is the size of a nonspecific 
RNAP complex. This size is not directly known, either for E. coli or T7 RNAP. For certain 
other proteins the size of a nonspecific complex is estimated to be several times smaller |3H1 
than that of a specific one. Therefore we assume upper bounds on L are provided by the DNA 
"footprint" size protected by RNAP in nuclease digestion experiments. For translocating E. coli 
and T7 elongation complexes these values are = 30 bp HHI and = 24 bp respectively. 

3.1 Sequence Analysis 

For our initial sequence analysis we adopt model parameter values coinciding with those of 
Salerno's [0] original study of T7 promoters. Setting the ratio r] = 2 x 10~3 implies the lower 
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bound for breather width is Lmin = ~ 15 bp. Figure la shows the region of Vav cor- 

responding to the initial 850bp fragment of the T7 phage for a breather of width 30bp. For 
comparison Figure Ic shows the time evolution of the system Q with breathers initiahy placed 
at intervals within the fragment. Comparison of the three trajectories with the effective poten- 
tial landscape in Fig. la serves to verify that the direction and range of propagation agree for 
the two methods.^. 

Now the a™ subunit of E. coli holoenzyme RNAP recognises hexamers located 35 and 10 
bases upstream of transcription initiation |51j . In addition many strong promoters are enhanced 
|52| by a UP element: contacts between the a subunit of RNAP and AT rich sequences centred 
approximately 40-60 sites upstream. Inspection of Figure la shows that the UP region of Ai 
and the -35 sites for A2, A3 (shown as dots) lie close to the bottom of potential wells: the 
respective initiation sites are 62, 29 and 21 bp upstream of these minima. Comparison with 
the noise parameter, e plotted in Fig. Ic shows that when the motion is strongly deterministic 
(|e| > 2) it is invariably towards regions where promoter recognition can occur. The e values 
in the initiation region of the strongest (^1) bacterial promoter are 1.5 times greather than 
anywhere else in the T7 genome. 

In fact there are seven E. coli RNAP specific promoters in the T7 genome, the first recognition 
sites for the six earliest are shown in Figure 2 as dots. The four minor {Aq, B, C and E) 
promoters, while having no recognised in vivo function, were found to have initiation sites 61 
(^0 transcribes leftwards), 27, 28 and 17 bp downstream of deep minima. Figure 2 also shows 
the full class I region of the T7 genome, transcribed by the bacterial RNAP, extending from 
the 5' DNA end to the bacterial transcription terminator, TE. In the Genbank (54) reference 
sequence (accession number NC_001604) this corresponds to sites ~500-7588. Note that other 
aspects of facilitated transport: dissociation followed by "hopping" or interdomain transfer are 
likeliest to occur in locally flat regions, where the breather spends most time. In this way, the 
effect of multiple, broad-bottomed wells as kinetic "traps" might be minimised. ^ 

On the other hand, the deep minimum at approximately 6 Kb could be a desirable kinetic 
trap for the host RNAP as it lies 100 bp downstream of the gene coding for T7 RNA polymerase. 
The T7 RNAP intitiates transcription at one of the two specific promoters (unfilled dots at the 
right side of Figure 2) and is thus responsible for the subsequent internalisation and expression 
of the remainder of the T7 genome. This deep minimum thus represents the end of the region 
where the host RNAP is "useful". One finds a similar, deep minimum at the class Il/class III 
interface for a wide range of breather widths which could play a similar role, inhibiting late 
transcription from weaker class II promoters in favour of class III promoters. 

3.2 Parameter variation 

Given the coarseness of the current model, it is important to understand how the results obtained 
may vary with respect to the parameter values. Up to an overall scaling, all parameter variation 
in the sequence-dependent part of © enter via the breather width, L, which governs sensitivity 
to sequence-dependent inhomogeneities: an increase of width leads to landscapes with fewer 
extrema which are also broader and larger in amplitude. The fundamental relationship governing 
effects of parameter variations is therefore cosec/x = L^/p. It is natural to associate the breather 
family parameter ^ with the dimension of the protein DNA-interface and L with the "response" 
of the system for a given set of environmental conditions, encapsulated in ry. 

Understanding of model robustness is complicated by the way in which the sliding of RNAP 

^Figure 1 to go here 
^Figure 2 to go here 
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changes. For example, if breathers do play a role in the location of T7 promoters Ai — by 
bacterial RNAP then one might expect environmental changes which alter sliding behaviour to 
also influence promoter activity. It is known that the activities of Ai — A^ are temperature 
dependent, with Ai increasing from 20 — 37° C while initiation at ^42,3 decreases under the same 
circumtances 

Due to decreased thermal stability, one expects a greater "reponse" of the helix to a defor- 
mation at increased temperature. For fixed /x this corresponds to an increase in L and decrease 
in 7]. The two graphs in Fig 3 are calculated for such circumstances with L = 30, rj = 0.002 
and L = 67, rj = 0.0004 respectively. For the higher rj value (on the left) a sliding RNAP is 
extremely likely to fall into one of the three wells associated with a major promoter.^ 

Conversely for the lower rj value the well containing Ai has greatly widened at the expense 
of the other two. Indeed the -35 sites for A2, A3 are now situated close to local maxima and 
the probability of an encounter with sliding RNAP would be significantly reduced. We note 
that minima close to one or more major promoter sites exist for a broad range of parameter 
values. One could argue that the overall sequence composure of the T7 initial fragment appears 
to confer some robustness of host promoter recognition against environmental variations. 

Having outlined the qualitative variation of the system behaviour with parameter changes, 
we recompute the potential for = 24 bp. From Figure 4a it is immediately seen that for 
none of the T7 promoters does the locally deepest minimum concide with upstream, recognition 
sites. With replication origins (pL and (pji and the earliest phage promoters, (pl.lA, (plAB 
omitted, minima appear to be correlated to the start of the first downstream coding sequence, 
as evidenced in Fig. 4b. ^ 

4 Discussion 

4.1 Model Assumptions 

The planar model of DNA presented is a highly simplified one, containing numerous assumptions 
which are unrealistic for modelling many DNA processes: There is no explicit allowance for the 
helical structure and its writhing/twisting behaviour. Many interactions with proteins involve 
major, localised conformational changes of DNA however the specific case of sliding RNAP 
may be an exception. Firstly, because such conformational changes are unlikely to be present 
immediately prior to closed complex formation and secondly, there is some evidence that 
rates of RNAP sliding , under some conditions at least, are independent of supercoiling |32j . 

Another important assumption was the homogeneous, harmonic nature of the restoring 
torques. In fact it is known that simple, "base content" models of helix-coil transition ther- 
modynamics reproduce empirical data for short (< 15 bp) DNA oligoucleotides quite well |58j . 
[5^] . Specifically, encapsulating sequence dependence as AT and GC contents enables reproduc- 
tion of such data at 310K in IM NaCl solution (corrections due to change in salt concentration 
are discussed in [SHI) with a mean (median) error of 9% (5%) (Bashford, J; unpublished). From 
our previous study of the thermodynamics of B-DNA helix-coil transition we further esti- 
mate that the enthalpies of A.T and G.C pairs are in the ratio 1.56/3, which serves to enhance 
the distinction between the two types of base pair in Eq.Q. This accounts, in addition to 
differing numbers of H bonds, to the averaged effects of solution, neighbouring base-pairs and 
other interactions between the complementary pair. 

^Figure 3 to go here 
^Figure 4 to go here 
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The assumption of harmonicity for the stacking potential at large opening angles, however, 
is more questionable and should be further refined. Also molecular calculations of the "base- 
flipping" in Watson-Crick pairs suggest opening into the major groove is more energetically 
favoured for purine bases. 



4.2 Breather dynamics 

The shape of the breather potential, used in the qualitative arguments above depend only upon 
the ratios of rj = k/K and )^a/t/^g/c- I^ut physical properties of any breather depend on the 
actual parameter values. For example, the breather energy E and oscillation frequency uj may 
be derived as 

E = l^^VfcK, (11) 

= ji^-^)- (12) 

Using the parameter values in Ref. jJUj: K = 5 x 10~^^ J, / = 2 x 10^^'^ kg m^, in combination 
with our estimate based on data from Ref. j6Uj : k = 1 x W^"^^ J, yields t] = 4.5 x 10"'^. Thus for 
a breather of width L = 30 bp we get 

^ ~ 2.7 X 10"^^ J, cj ~ 1.0 X lO^^s^^ 

The energetic cost of creating this breather may be of the magnitude of the electrostatic attrac- 
tions responsible for the nonspecific contact. 

Concerning the size of the DNA helix deformation, we note that parameter ^ provides an 
estimate of the amplitude of the base-pair opening. Umax = 4/i when /i < 7r/4. For (3 = 0.0045, 
as above, the amplitude for a 30 bp breather is 27r/3, corresponding to individual pendulum 
deformations of 60°. This parameter set does not support breathers of width less than ~ 15 

bp. A variation of 20% in the value of K leads to maximum deformations of 52° — 65°: base pairs 
are bent but not fully opened. These moderate conformational changes need not be incompatible 
with an anticipated absence of large deformations 57^ accompanying nonspecific RNAP-DNA 
complexes. 

The values for model parameters appearing in the literature are estimated from old experi- 
ments on DNA homopolymers, for example Refs. [02]) [Hni which is a difficult process. However 
the main results of our paper stem from i) the shape of the potential © and ii) the noise pa- 
rameter, e, defined by (|1()|) . For these two expressions changes in the parameter rj can be offset 
by "tuning" the value of ^ which is a relatively free parameter. The only potentially serious 
sensitivity is that of e to large changes in k, the measure of dissociation energy for H-bonded 
base pairs. Fortunately, of the three parameters in this is the most reliable quantity to 
estimate. 



4.3 Helical model 

If the picture of sliding RNAP as a soliton-like deformation is subsequently shown to be incorrect, 
the correlations observed between potential minima and promoter sites still have to be explained. 
The soliton solutions of preferentially move to AT-rich regions. Inspection of @ shows 
the variation due to sequence is not linear in AT content, but a first "moment", where the 
contribution from each base is weighted by its position relative to the central site X: 

Vaar{X) ~ Y^^M^i), (13) 
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^ ^ 12 I r^e~\c}\^ 



coshz 
(tan^ [I + cosh^ z)^/^ 

Curiously, this weighting function coincides with the inverse radius of curvature for a hyperbolic 
curve f{z) = coshz. Such a term arises naturally in the Lorenz force experienced by a charged 
particle fohowing a curved magnetic field hne. Initially consider a particle of mass m, charge 
travelling along a uniform, straight magnetic field line. Its motion is determined by the Lorenz 
equation 

—V = — V X B. 
at m 

Assuming the field line lies along the z axis, B = Biz, the velocity equation is split into parallel 
and perpendicular components 

d ^ qB ^ 

—v± = — v± X 62 
dt m 

The general solution to these equations is a helical trajectory, with time-dependent coordinates 

I ^-L I 

x{t) = xq -\ sinfwt + (/)), 

\vj_ I 

y{t) = yo-\ cos(wt + (/>), 

UJ 

Z{t) = ZQ + V\\t, 

where (xo,yO)-2o) denotes the initial location of the particle and uj determines the helical fre- 
quency. This problem naturally resembles the electrostatic sliding of a protein "particle" along 
the grooves of the DNA helix. Here the role of gyro frequency is played by the twist of the helix, 
while the guiding centre of particle motion {xq^uq, z{t)) corresponds to the central helical axis 
of the DNA. 

Consider now the effect of introducing a curve into the helical axis: a particle travelling along 
a curved field line experiences a centrifugal force upon its guiding centre. In a local coordinate 
system this is 



kc(s)| \rc{s)\ 

where \rc\ and s denote the radius of curvature and line element along the field line. Similarly 
let us here write an analogous expression 

Fc = -fc (14) 

where the quantity 8 has the dimensions of energy. In particular, we assume that locally the 
bend can be approximated by z{£) = cosh,^ Then, c.f. (fT^ . 

It follows that in the continuum limit the time-averaged breather potential could also be thought 
of as the work done by a "centrifugal force" on a sliding RNAP as it navigates a bend in the 
helix. Therefore the "potential" © can conceivably be arrived at via simple considerations of 
thermal stability (in a planar model) or bending deformations (in a helical model), two of the 
most commonly suggested mechanisms for enhancing promoter recognition. 
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4.4 Superhelicity 

A mechanism of localised DNA deformation with demonstrated biological significance |18j . 

is that of superhellcal stress-induced DNA denaturation (SSID). Roles for SSID in 
gene regulation have been proposed ^Hl in regard to both open complex formation and tran- 
scription. In the former instance, promoter sites are easily destabilized by superhelical stress. 
In the latter, the action of local helix unwinding by transcribing RNAP results in waves of pos- 
itive (negative) superhelicity propagating downstream (upstream) of the transcription complex. 
Computation of SSID profiles indicates ^Hl) [00] AT rich regions (down-) up-stream of the (3') 5' 
ends of transcription units are prone to localised over /under-winding acting as a possible "sink" 
for propagating superhelicity and ensuring smooth transcription. 

The breather potential ©, which also picks out regions of AT shows that transcription units 
of at least 10^ bp in length are often demarcated by minima, in agreement with the above 
observations. This is especially the case for the 3' ends of T7 genes 1 and 6, the last genes in 
class I and II regions respectively. In these instances the AT richness may also confer extra 
rigidity, making these suitable pause sites in the stepwise internalisation of the phage genome, 
or as mentioned above act as a kinetic trap, used in inhibiting class I or II transcription. 

4.5 Correlations 

In reporting promoter-extrema correlations two points should be kept in mind. Firstly, the 
assumed breather widths coincide with the sizes of the elongation RNAP-DNA complexes. 
Therefore potential minima could be indicative of deformation associated with transcription, 
as appears to be the case for T7 phage promoters, shown in Figure 4. Regarding nonspecific 
complexes, the values = 30 and L^p = 24 bp should be considered as upper bounds for an 
experimentally undetermined quantity. The correlations reported in this study persist for the 
ranges 20 < < 30 and 18 < < 24. For sizes less than ISbp, the increasing roughness of 
Eq.Q causes difficulty in identifying correlations. 

The second caveat is that only correlations between promoter initiation and the deepest 
local minimum have been considered. For some T7 promoters shallow upstream wells also exist. 
Moreover the effect of thermal noise has not been considered. Only with full dynamical simula- 
tions can connections between the local topography of Eq.Q and facilitated target location be 
properly studied. 

It is difficult to see how kink solutions of the planar model previously considered [6]-|llj 
might mimic physical profiles of base-pair opening. Kinks will also move preferentially to AT rich 
regions, presumably the reason why promoter sequences Ai 1^1, ^3 and ^0 iZj were concluded 
to be "dynamically active". The unit-mass potential for kinks, initially at rest, moving in a 
slowly- varying background was derived by Salerno and Kivshar [S]. The sequence variation is 
contained in a term analogous to (|13|) . however the weighting function is 

Wkiz) = sech^z. 

This coincides with the breather function for small tan^ n, illustrating why similar results for 
the major T7 promoter sequences are obtained for both kink [Sj-jH] and breather solitons. 

5 Conclusion 

In this paper we have re-examined Salerno's nonlinear DNA model, postulating a role for lo- 
calised soliton excitations in approximating the sliding component of facilitated target location 
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of RNA polymerase. We found that such deformations would involve moderate bending of in- 
dividual base pairs and that their energy of translocation is consistent with a picture of noisy, 
deterministic dynamics. Both of these observations are also consistent with current, limited 
knowledge of RNAP sliding and nonspecific complexes. A qualitative correspondence of these 
solitons and localised bending in a helical model was also demonstrated. 

The dynamical picture of sliding which emerged also suggests that the random/deterministic 
nature of the motion is sequence-dependent, with translocation in relatively homogeneous regions 
being effectively random. The corollary, that interplay between adjacent random and determin- 
istic regions could constitute a search "algorithm" , is speculative and, we believe, merits further 
investigation. 

Our analysis of the T7 genome showed good correlations between AT-rich regions and the 
recognition sites of host-specific promoters used for early phage transcription. For phage-specific 
promoters, regions of maximal AT-richness correlated with the start of the coding sequence 
immediately downstream. As discussed above this may be connected with transcription and 
while there is no obvious correlation with recognition sites, a full description of facilitated target 
location needs to account for the thermal background. This is a subject of current investigation. 

We note that there has been suggestion (B^ that virion proteins injected into the host cell 
with the initial T7 fragment may i) inhibit the nonspecific binding of restriction enzymes and 
other proteins to DNA; ii) have an affinity for E. coli RNAP, negating the requirement for direct 
promoter recognition in vivo. Similarly, inhibition of class I and II transcription is known to be 
performed by T7 gene products: kinase (gene 0.7) and lysozyme (gene 3.5) respectively. 

However we see similar correlations for the UP and a'^^ sites of bacterial promoters in other 
members of the T7 viral supergroup, in addition to genomes of the unrelated phages T4 and 
T5 (see Figure 5). This may be suggestive of a mechanism at work to enhance promoter recog- 
nition/inhibition in lytic phage genomes, although in the presence of functional proteins this 
mechanism can be relegated to an auxiliary role, such as in T7. ^ 

It is important to investigate whether planar base- flipping/helical bending deformation pat- 
terns can be used to simulate protein-DNA interactions in DNA sequence analysis. The cor- 
relations reported here, to our knowledge for the first time, could have been made via other 
"nonlinear" analyses of AT content, had a motivation been apparent. Propagation of breathers 
in a non-linear, toy model of DNA provide a source, for such motivation. It may be that herein 
lies the true value of a model with such a controversial history. 
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Figure 1: 

6 Figure Captions 

Figure 1: a) Effective potential Q for breathers in the initial T7 virion fragment. Initial 
binding sites for bacterial promoters are denoted by dots; b) Noise parameter e{X) for 
the same sequence, c) Evolution over 1000 time-steps of the system Q with breathers 
initially placed at sites 460, 570 and 680. 

Figure 2: Effective potential (jHI) for 30bp wide breathers in the class I region of the 
T7 genome. Filled and unfilled dots denote respectively UP or -35 E. coli and +1 T7 
promoter sites. 

Figure 3: Potential Q computed for the T7 initial fragment for /i = 7r/6.05. a) 
T] = 0.002, L = 30 bp; h) r] = 0.0004 (L = 67 bp); Dots denote, from left to right, 
UP and -35 sites for Ai — bacterial promoters. 

Figure 4: a) Location of minima of (jH)) nearest initiation sites of T7 phage promoters; b) 
Scatter plot of initiation-downstream transcription unit distance (TU) versus initiation 
minima distance (Min). 

Figure 5: Representative region of T5 genome potential, showing correlations between 
potential minima and -35 sites for E. coli promoters (L = 30 bp). 
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